PaddyWaC: A Minimally-Supervised Web-Corpus of Hiberno-English

نویسندگان

  • Brian Murphy
  • Egon Stemle
چکیده

Small, manually assembled corpora may be available for less dominant languages and dialects, but producing web-scale resources remains a challenge. Even when considerable quantities of text are present on the web, finding this text, and distinguishing it from related languages in the same region can be difficult. For example less dominant variants of English (e.g. New Zealander, Singaporean, Canadian, Irish, South African) may be found under their respective national domains, but will be partially mixed with Englishes of the British and US varieties, perhaps through syndication of journalism, or the local reuse of text by multinational companies. Less formal dialectal usage may be scattered more widely over the internet through mechanisms such as wiki or blog authoring. Here we automatically construct a corpus of Hiberno-English (English as spoken in Ireland) using a variety of methods: filtering by national domain, filtering by orthographic conventions, and bootstrapping from a set of Irelandspecific terms (slang, place names, organisations). We evaluate the national specificity of the resulting corpora by measuring the incidence of topical terms, and several grammatical constructions that are particular to Hiberno-English. The results show that domain filtering is very effective for isolating text that is topic-specific, and orthographic classification can exclude some non-Irish texts, but that selected seeds are necessary to extract considerable quantities of more informal, dialectal text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Minimally Supervised Method for Multilingual Paraphrase Extraction from Definition Sentences on the Web

We propose a minimally supervised method for multilingual paraphrase extraction from definition sentences on the Web. Hashimoto et al. (2011) extracted paraphrases from Japanese definition sentences on the Web, assuming that definition sentences defining the same concept tend to contain paraphrases. However, their method requires manually annotated data and is language dependent. We extend thei...

متن کامل

ParaMor: Minimally Supervised Induction of Paradigm Structure and Morphological Analysis

Paradigms provide an inherent organizational structure to natural language morphology. ParaMor, our minimally supervised morphology induction algorithm, retrusses the word forms of raw text corpora back onto their paradigmatic skeletons; performing on par with state-ofthe-art minimally supervised morphology induction algorithms at morphological analysis of English and German. ParaMor consists o...

متن کامل

Minimally supervised lemmatization scheme induction through bilingual parallel corpora

We present a lemma induction scheme on a target language through minimally supervised alignment and transfer methods utilizing English-to-German parallel corpora. Compared to previous alignment and transfer approaches, the approach outlined here increases computational efficiency and significantly reduces the level of supervision necessary in inducing clusters of inflectional forms. Furthermore...

متن کامل

Generalisation in the Automatic Acquisition of Phonotactic Resources

Once acquired, linguistic resources for languages can be used to develop speech applications for the languages under consideration. This paper presents a fully automatic approach to the acquisition of phonotactic resources from syllable labelled data sets. While the technique requires no user intervention, the quality of acquired resources is heavily dependent on the nature and content of the s...

متن کامل

Comparing minimally supervised home-based and closely supervised gym-based exercise programs in weight reduction and insulin resistance after bariatric surgery: A randomized clinical trial

    Background: Effectiveness of various exercise protocols in weight reduction after bariatric surgery has not been sufficiently explored in the literature. Thus, in the present study, we aimed at comparing the effect of minimally supervised home-based and closely supervised gym-based exercise programs on weight reduction and insulin resistance after bariatric surgery.  &n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011